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SUMARY 


A.  Problem 


This  report  is  addressed  to  the  problem  of  developing  procedures  for 
clustering  individuals  or  objects  into  similar  "types."  Such  procedures 
could  be  useful  in  producing  an  objective  analysis  and  revision  of  the 
Navy  rating  structure  by  classifying  positions  with  similar  patterns  of 
skill  requirements  into  the  same  occupational  category. 

B.  Background  and  Requirements 

1.  Cluster  analysis  methods  of  various  kinds  have  been  employed  in  the 
study  of  individual  differences,  in  the  taxonomy  of  biological  organisms, 
in  the  classification  of  documents  for  information  retrieval,  in  the  study 
of  Navy  enlisted  basic  skill  patterns,  and  in  unsupervised  pattern  recog¬ 
nition  of  electronic  signal  patterns. 

2.  Recent  interest  in  this  area  has  been  stimulated  by  the  advent  of 
high-speed  digital  computers  capable  of  carrying  out  cluster  analysis 
automatically.  However,  most  existing  methods  contain  certain  arbitrary 
factors  or  assumptions  which  are  difficult  to  justify  statistically.  Since 
different  methods  can  give  different  results,  there  is  a  clear  requirement 
for  the  development  of  a  technique  which  is  rigorously  derived  from 
statistical  theory.  Such  a  technique  was  presented  by  Wolfe  (1965,1967) 
for  the  special  case  of  mixtures  of  multivariate  normal  distributions.  The 
present  report  generalizes  these  methods  to  other  distributions. 

C.  Approach 

The  approach  involves  reformulating  cluster  analysis  as  a  problem  in 
the  estimation  of  the  parameters  of  a  mixture  of  distributions.  Maximum- 
likelihood  (ML)  methods  are  used  exclusively  because  of  the  ease  with 
which  they  can  be  generalized  to  multivariate  distributions  of  various 
forms. 

D.  Findings 

1.  Regardless  of  the  shape  of  the  distribution  the  maximum- likelihood 
estimate  of  the  proportion  of  a  mixture  from  a  given  type  is  equal  to  the 
sample  mean  of  the  probability  of  members'. ip  of  the  objects  in  that  type. 
The  equations  for  the  maximum- likelihood  estimates  of  the  parameters  of  a 
mixture  are  the  weighted  averages  of  the  expressions  used  in  obtaining  ML 
estimates  for  pure  types,  where  the  weights  are  the  probabilities  of 
membership. 

2.  The  estimation  procedures  for  normal  mixtures  with  unequal  co- 
variances,  normal  mixtures  with  equal  covariances,  and  mixtures  of  latent 
classes  are  derived  as  special  cases  of  the  general  theory. 
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3.  Various  iteration  techniques  are  discussed  for  obtaining  numerical 
solutions  to  mixture  problems. 

4.  The  examples  of  the  results  of  two  computer  mixture  analysis 
programs  (NORMIX  and  NORMAP)  indicate  that  the  theory  is  sound  for  large 
samples  and  that  the  procedures  given  in  this  paper  are  practical. 

E.  Conclusion 


A  practical  and  statistically  rigorous  method  of  cluster  analysis  has 
been  developed. 

F.  Recommendations 


1.  It  is  recommended  that  the  computer  programs  NORMIX  and  NORMAP  be 
used  in  Naval  research  studies  requiring  a  cluster  analysis  of  continuous 
measurement  patterns. 

2.  Further  development  of  computer  programs  for  clustering  discrete 
data  patterns  is  desirable. 
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PATTERN  CLUSTERING  BY  MULTIVARIATE  MIXTURE  ANALYSIS 


1 .  INTRODUCTION 

This  paper  is  addressed  to  the  problem  which  has  been  variously 
called  cluster  analysis,  Q-analysis,  typology,  grouping,  clumping,  class¬ 
ification,  numerical  taxonomy,  and  unsupervised  pattern  recognition.  The 
variety  of  nomenclature  may  be  due  to  the  importance  of  the  subject  in 
such  diverse  fields  as  psychology,  biology,  signal  detection,  artificial 
intelligence,  and  information  retrieval.  Perhaps  this  multiplicity  of 
names  also  indicates  a  certain  confusion  in  the  basic  definition  of  the 
problem.  This  paper  attempts  to  clarify  the  formulation  of  the  problem, 
with  a  resulting  improvement  in  conceptual  simplicity  and  statistical 
rigor. 

In  classification  methodology,  one  is  generally  given  a  sample  of  N 
objects  or  individuals,  each  of  which  is  measured  on  m  variables.  From 
this  information  alone,  one  must  devise  a  classification  scheme  for 
grouping  the  objects  into  r  classes.  The  number  of  classes  and  the 
characteristics  of  the  classes  are  to  be  determined.  If  all  the  objects 
in  a  given  class  were  identical  to  one  another,  the  problem  would  be 
simple.  However,  in  the  usual  situation  the  objects  in  a  class  differ  on 
most  or  all  of  the  measures.  Most  cluster  analysis  procedures  try  to 
measure  the  "similarity"  of  objects  within  a  class,  and  then  try  to  group 
the  objects  so  as  to  maximize  within-class  similarity.  Unfortunately,  the 
appropriate  measure  of  similarity  is  a  subject  of  some  controversy.  It 


would  be  desirable  to  derive  a  cluster  analysis  system  without  arbitrary 
assumptions  about  similarity.  Such  a  system  will  be  presented  in  this 
paper. 

Since  the  objects  within  a  class  differ  from  one  another,  it  is 
reasonable  to  assume  the  existence  of  a  probability  distribution  of 
characteristics  for  a  population  belonging  to  this  class.  Elements  of 
a  different  class  will  have  a  different  probability  distribution  of 
characteristics.  The  combined  population  taken  from  all  classes  will 
have  a  probability  distribution  which  is  a  mixture  of  distributions. 

The  problem  is  to  io’entify  and  describe  the  component  distributions  from 
a  sample  drawn  from  the  mixture.  Before  it  is  possible  to  solve  this 
problem,  some  assumptions  must  be  made  about  the  forms  of  the  component 
distributions.  For  example,  the  component  distributions  are  usually 
assumed  to  be  unimodal.  The  purpose  of  classification  methodology  is  to 
take  a  complicated  multi-modal  distribution  and  analyze  it  into  simple 
familiar  components.  Therefore,  the  component  distributions  can  usually 
be  assumed  to  be  standard  statistical  distributions  with  unknown  para¬ 
meters.  The  classification  problem  can  then  be  solved  by  standard 
statistical  techniques  of  parametric  estimation.  This  is  the  approach 
taken  in  the  present  paper. 

Over  70  years  ago  Karl  Pearson  (1894)  used  the  method  of  moments  to 
estimate  the  parameters  of  a  mixture  of  two  univariate  normal  distributions. 
Maximum- likelihood  methods  for  a  special  case  of  the  same  problem  were 
presented  by  Rao  (1952).  Studies  of  mixtures  of  univariate  discrete  dis¬ 
tributions  have  been  reviewed  by  Blischke  (1963).  Maximum-likelihood  (ML) 


estimation  procedures  for  mixtures  of  multivariate  normal  distributions 
were  presented  by  Wolfe  (1965,1967).  Similar  ML  estimation  methods  were 
presented  by  Hasselblad  (1966)  for  univariate  normals,  and  Cohen  (1967) 
developed  simplified  moment  estimators  for  the  univariate  normal  case. 
Moment  estimators  for  various  special  cases  of  mixtures  of  multivariate 
normals  have  been  presented  by  Cooper  (1967).  Stanat  (1968)  and  Sammon 
(1968)  developed  multivariate  generalizations  of  Medgyessy's  (1961)  methods 
for  estimating  the  parameters  of  a  mixture  from  a  Fourier  approximation 
to  the  sample  distribution. 

Lazarsfeld's  "Latent  Structure  Analysis"  (1959)  is  closely  related 

to  the  mixture  analysis  problem.  In  "Latent  Class  Analysis,"  the  observed 

contingencies  among  several  dichotomous  variables  are  explained  by 

assuming  the  population  is  a  mixture  of  "latent  classes"  within  each  of 
% 

which  the  variables  are  independently  distributed.  Gibson  (1959)  succeed¬ 
ed  in  generalizing  Lazarsfeld's  model  to  mixtures  of  spherical  multi¬ 
variate  normal  distributions. 

This  paper  summarizes  this  author's  previous  work  on  mixtures  of 
multivariate  normal  distributions  and  generalizes  the  theory  to  mixtures  of 
multivariate  distributions  of  almost  any  given  form.  The  approach  is  exclu 
sively  that  of  maximum-likelihood  estimation.  Although  other  procedures 
are  valuable  in  special  cases,  maximum- likelihood  methods  seem  to  be  the 
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easiest  to  generalize  and  the  most  efficient  in  the  way  they  use  the  in¬ 
formation  in  the  sample.  They  probably  have  been  overlooked  in  the  past 
because  of  the  amount  of  computation  required  to  solve  the  ML  equations 
numerically.  Computation  costs  are  no  longer  prohibitive  w i • modern 
electronic  computers,  as  wc  shall  illustrate  by  presenting  the  ML  solution 
obtained  in  one  minute  to  the  classic  Fisher  Iris  mixture  problem. 

2.  GENERAL  MIXTURE  ANALYSIS 


Let  Oj(x,0j),  a^fx.S^),  .  .  .  ,  a  (x,0  )  be  r  probability  distri¬ 
butions  defined  on  an  m-dimensional  space  of  random  vectors: 

x  =  (XlfX2,  •  •  •  ,  XJ  . 

Assume  each  is  a  twice  differentiable  function  of  its  parameters, 

e  =  (e  , ,o  ,  o  ). 

s  1  si  s2  sq 

Suppose  a  mixture  of  distributions  is  formed  by  taking  proportions 
{ A s }  of  the  population  from  types  (a^).  The  probability  distribution 
of  the  mixture  is  given  by 


f(x)  =  l 


s=l 


A  a  (x ,0  )  , 
s  s  s 


(2.1) 


r 

where  E  X  =  1  (2.2) 

is 

s  =  l 


The  "probability  of  membership"  of  a  vector  x  in  type  s  can  be 
defined  as 


P(s  x) 


„  ,  n  ,  |  .  A  O  (X  ,  0  ) 
P  (s)  •  P  (x  I  s )  _  s  s 

P(x)  f (X) 


(2.3) 


1 


Suppose  a  sample  of  N  random  vectors  is  drawn  from  the  mixture. 

The  k^  random  vector  is  represented  by 

xk  =  (Xlk’A2k’  *  ‘  *  *  Xmk)  * 

The  maximum- likel ihood  estimates  of  the  parameters  are  those  values 
of  which  maximize  the  likelihood  of  the  sample, 

N 

log  L  =  Z  log  f(x  ), 
k=l  K 

r 

subject  to  the  constraint  Z  A  =  1 

.  s 

S=1 

Using  a  Lagrangian  multiplier  gj,  we  form  the  function 

N  r 

log  L'  =  Z  log  f(x,  )  -  u»(E  X  -1)  .  (2.4) 

k-  1  K  s=l  S 


The  necessary  equations  for  maximum  likelihood  are  obtained  by 
setting  the  derivatives  of  log  L'  to  zero  as  follows: 


9  log  L' 

9  A 


N 

=  Z 
k=l 


1 


f(x.)  as(V  w  "  0  * 


(2.5) 


si 


N  A  9(a) 
s  s 

=  Z  — - —  - -  =  n 

I  .  f(xj  96  . 
k=l  k  si 


(2.6) 


Multiplying  (2.5)  by  A  ,  and  substituting  from  (2.3),  we  have 


N 

X  9  l0£  l'  =  E  P (s  | x,  )  -  wA  =0  . 

S  9 A  k=l 

s 


(2.7) 
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Summing  across  s  and  using  (2.1)  and  (2.2),  wc  find  that  u>=N. 

After  a  little  algebra  on  equation  (2.7),  we  obtain  the  following: 

Theorem  1  The  maximum- 1 i kel ihood  estimate  of  the  proportion  of  a  mixture 
from  a  given  type  is  equal  to  the  sample  mean  of  the  probability  of 
membership  of  the  objects  in  that  type;  and  the  likelihood  equation  is 
given  by: 

1  N 

A  =  J7  Z  l’(s  |x.  )  .  (2.8) 

S  *  k  =  l  K 


The  concept  of  probability  of  membership  also  helps  to  clarify 
equation  (2.6).  When  substitution  is  made  in  (2.6)  from  (2.3),  the  result 
is  the  following: 

Theorem  2  The  equations  for  the  max imum- 1  ike  1 ihood  estimates  of  the 
parameters  of  the  distributions  comprising  a  mixture  are  given  by: 


30 


log 


s  i 


k=l 


*  l 


(2.9) 


If  the  entire  population  were  drawn  from  one  type,  n  ,  the  equations 

for  the  maximum- 1 ikcl ihood  estimates  of  0  would  be 

s 


N  i  log  a 

T.  — — - -  =  0  .  Thus  the  equations  for  the  ML 

k=  1  si 

estimates  of  the  parameters  of  a  mixture  arc  the  weighted  averages  of  the 
expressions  usrj  in  obtaining  ML  estimates  for  pure  types,  where  the 
weights  are  the  probabilities  of  membership. 


Usually  the  number  of  types,  r,  is  only  a  hypothesis  which  can  he 
tested  against  an  alternative  hypothesis  of  r'  types  by  finding  maximum- 


likelihood  estimates  under  both  hypotheses  and  testing  the  likelihood 
ratio  by  the  formula  x2  =  -2  log  (L^/L^)  with  degrees  of  freedom  equal 
to  the  difference  in  the  number  of  parameters  estimated.  Likelihood 
ratio  tests  may  also  be  used  to  test  alternative  hypotheses  concerning 
the  forms  of  the  component  distributions.  Of  course  the  distribution 
of  the  logarithm  of  the  likelihood  ratio  is  approximately  chi-square  for 
large  samples  only. 

In  most  cases,  equations  (2.8)  and  (2.9)  will  have  to  be  solved 
numerically.  For  this  purpose  and  also  for  the  purpose  of  obtaining 
confidence  intervals,  it  is  desirable  to  have  some  approximation  to  the 
information  matrix: 


I  = 


I 

I 

AA 

A  0 

*0A 

*00 

(2.10) 


vherc  I  is  partitioned  into  the  sub-matrices  I ^ , 1 ,  defined  as 


A  A 


AG 


3  log  L 


3  A 

3A  )j 

s 

P  '  1 

’  3_ 

log  L 

8  log  L  \  ) 

3  A 

30  .  /  f 

s 

p.l  '  1 

3 

log  L 

3  log  L  \  ) 

30  , 

30  .  /  f 

S  1 

X 

‘ex  ,s 

the  transpose 

S  P 


*(s|x)  P 


(p|x))  } 


(2.11) 


8  log  a 

>(p|x)  - E 


30  .  ')}(2-12) 

PJ  I  ’ 


3  log  as  3  log  aiw2il3) 


30  . 

PJ 


l)) 


‘AO’  — - -  - - 

ellipsoids  can  be  developed  with  the  help  of  the  inverse  of  the  infor¬ 
mation  matrix 


V  =  I 


-1 


(2.14) 
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which  gives  the  large-sample  dispersions  of  the  ML  estimators.  The  ML 
equations  (2.8)  and  (2.9)  can  be  solved  iteratively  by  the  "method  of 
scoring"  (Kale,  1962)  using  the  following  equations: 


lAAs  ) 


(2.15) 


where  { AA  }  and  (A0  .  }  are  column  vectors  for  the  increments  in  the  estimates 
s  si 

used  in  the  next  numerical  iteration.  . 

In  most  cases  the  expectations  in  the  information  matrix  involve 
integrals  which  are  impossible  to  evaluate  in  closed  form  and  difficult 
to  approximate  by  series.  The  usual  approach  has  been  to  estimate  the 
information  matrix  from  the  sample,  replacing  the  expectation  symbol  L 

1  N 

in  (2.11-2.13)  by  —  I  .  This  approach  is  satisfactory  if  the  infor- 


k=  1 


mation  matrix  is  to  be  calculated  only  once  for  the  purpose  of  obtaining 
confidence  regions.  It  is  prohibitively  expensive  if  I  has  to  be  re- 
estimated  many  times  during  the  iteration  (2.13).  Some  alternative 
iteration  techniques  will  be  developed  in  the  next  section. 


3.  ITERATION  METHODS 

'  o 

Fortunately  certain  approximations  are  often  possible  when  solving 
mixture  problems  numerically.  First  let  us  consider  a  limiting  case. 

When  the  component  types  of  a  mixture  are  widely  separated,  each  point 
will  have  a  probability  of  membership  close  to  unity  for  one  of  the  types 
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and  nearly  zero  for  the  other  types.  In  other  words,  the  probabilities 
of  membership  come  close  to  defining  a  partition  of  the  sample  points  into 
discrete  clusters.  The  product  of  two  probabilities  of  membership  can  be 
then  approximated  by 


P(s|x)P(p|x)  ^  6  P(s|x), 

pb 


(3.1) 


where  6  is  the  Kronecker  delta. 

>  r  fins  approximation  is  inserted  into  the  information  matrix  the 


I  %  N  U 

AA  I  Xs 


!>e  "  0 

(  /  ^  log  3  log  a 

Tee  '  N  V  H-W-*  3e~~^  pls  IX1 

(  K  \  SI  Sj 


The  information  matrix  for  the  mixing  proportions  is  seen  to  be 
approximately  diagonal,  and  since  I  'vO  >  the  iteration  for  the  mixing 
proportions  can  be  carried  out  independently  of  the  iteration  for  the 

other  parameters.  The  approximation  for  I  is  seen  to  be 

00 

I  „  -v  {N  5  I  }, 

00  ps  s 

where  Ig  is  the  information  matrix  of  the  parameters  (0  p  for  a  single 
observation  from  a  pure  distribution,  a  .  Thus,  if  the  distributions  do 
not  overlap  very  much,  the  iterations  for  the  parameters  of  one  type  do 
not  involve  terms  from  the  other  types. 


Equation  (2. IS)  then  reduces  to  the  two  approximations: 


N 

Y 

k  =  l 


s 


P(s|xk)  -  As  . 

1  N 

rr  l  i’(s|x  ) 
k=l 


9  log  as 


90  . 


The  first  equation  can  he  put  in  the  form 

1  N 

A'  =  A  +  AA  =  -  T.  P(s|x .), 
s  s  s  X  k=1  K 


(3.2) 

(3.3) 


where  A'  is  the  estimate  of  Ag  on  the  next  iteration.  This  is  precisely 
the  same  form  as  (2.8),  indicating  that  a  good  iteration  technique  is 
simply  to  calculate  the  right  hand  side  of  (2.8)  with  old  estimates  of  the 
parameters  and  thus  obtain  improved  estimates  on  the  left  side.  For  certain 
distributions  (3.3)  also  becomes  extremely  simple. 

The  approximations  introduced  from  (3.1)  do  not  affect  the  final 
solution  of  the  likelihood  equations,  they  only  affect  the  rate  of 
convergence  (or  lack  of  it)  in  the  iteration  process.  I.ven  when  there 
is  an  appreciable  amount  of  overlap,  the  information  matrix  often  is 
dominated  by  its  diagonal  to  a  sufficient  degree  that  the  simplified 
iteration  methods  remain  useful. 

The  preceding  results  have  been  derived  for  an  extreme  case;  let  us 
see  what  can  be  derived  under  a  less  restrictive  assumption.  The  fact 
that  many  of  the  off-diagonal  elements  of  I  tend  to  vanish  in  the  extreme 
case  of  non-overlapping  distributions  makes  it  plausible  to  assume  that 
in  many  cases  of  interest,  the  diagonal  will  dominate  the  matrix,  i.c. 
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It  is  helpful  to  place  some  minimum  value  (such  as  .05)  on  the  values 


that  D  ^  and  D  are  allowed  to  atta'n,  and  to  prevent  D  from  taking  on  a 
value  so  small  that  some  X” ’  becomes  negative  or  greater  than  one. 

The  above  methods  all  require  initial  estimates  to  lie  within  some 
radius  of  convergence  of  the  true  maximum- 1 ikel ihood  values;  otherwise 
iterations  will  diverge.  The  methods  discussed  represent  only  a  few  of 
a  large  number  of  possible  numerical  techniques;  it  seems  quite  possible 
that  better  methods  for  mixture  problems  will  be  developed. 

4.  MULTI  PI. Li  SOLUTIONS  AND  INITIAL  1  .ST  I  MATHS 
The  likelihood  equations  defined  by  Theorems  1  and  4  nave  a  large 
number  of  solutions  corresponding  to  absolute  maxima,  relative  maxima, 
saddle  points  and  minima  of  the  likelihood  function.  It  is  easy  to  see 
that  the  re  are  at  least  r!  absolute  maxima,  because  if  T  is  any  permutation 
of  the  integers  1,  2.  .  .  .  ,  r  and  ('.,0.}  is  an  absolute  max imum- 1 ikel  i- 
hood  solution,  then  another  absolute  maximum  is  defined  by  the  mapping 


A  ‘  V,  l 

s  T(s)  , 


0.  -  0...  . 

s  1  (s) 


Between  the  absolute  maxima  will  be  relative  minima  and  saddle  points, 
which  also  satisfy  the  likelihood  equations.  Lor  example,  if  an  absolute 
max i mum- 1 i ke 1 ihood  solution  is  available  for  r- 1  types,  then  r- 1  families 
of  saddle -ridge  solutions  for  r  types  can  be  generated  by  setting  the 
parameters  of  the  r^'  type  equal  to  those  of  one  of  the  other  types.  If 
two  types  have  identical  parameters  (0  =0^.),  then  the  likelihood  remains 
unchanged  as  long  as  A  +A  =constant.  Additional  solutions  can  be  generated 


by  restricting  three  of  the  types  to  have  identical  parameters,  then  four 
of  the  types,  then  two  sets  of  two  types,  and  so  forth.  These  degenerate 
solutions  are  easily  spotted,  and  it  is  not  difficult  to  program  a  computer 
to  avoid  them  as  well  as  the  relative  minimum  likelihood  solutions. 

There  are  many  other  relative  maxima  which  are  more  difficult  to 
identify  as  such.  If  an  absolute  maximum- 1 ikel ihood  solution  is  available 
for  r  types,  then  one  way  of  generating  initial  estimates  for  an  r-1  type 
solution  is  to  combine  two  of  the  clusters  into  a  larger  one  with  greater 
dispersions  and  a  centroid  somewhere  between  the  two  component  clusters. 

There  are  (^)  ways  of  selecting  two  out  of  r  clusters.  Not  all  of  these 
different  initial  estimates  will  result  in  different  solutions  after 
iteration,  but  experience  has  shown  that  some  of  them  will.  Given  two  or 
more  solutions,  it  is  always  possible  to  choose  the  "best0  by  selecting 
the  one  with  the  largest  likelihood.  Unfortunately,  there  is  no  algorithm 
for  generating  all  possible  solutions  or  for  proving  that  a  given  solution 
is  an  absolute  maximum. 

Most  other  cluster-seeking  algorithms  suffer  from  the  same  difficulty: 
they  can  converge  on  a  sub-optimal  solution.  Ball  (1967)  and  Friedman 
and  Rubin  (1967)  report  they  they  can  obtain  different  solutions  depending 
on  the  initial  estimates.  The  practical  answer  is  to  try  a  variety  of 
initial  estimates  on  any  given  problem  and  to  use  various  heuristics  for 
generating  plausible  initial  estimates.  Some  of  these  heuristics  are  Ball 
and  Hall's  "cluster  splitting,"  and  "cluster  lumping,"  Friedman  and  Rubin's 
"forcing  passes,"  Forgy's  "reassignment  passes  (1965),"  and  random  partition¬ 
ing.  A  strong  case  can  be  made  for  including  a  human  being  in  the  system 
for  generating  initial  estimates.  Ball  and  Mall's  I’ROMF.NADF  (1967)  uses 


CRT  display  consoles  to  produce  on-line  graphical  representation  of 
sample  scatter.  The  console  operator  can  rotate  the  display  through 
several  dimensions  and  can  select  initial  estimates  by  pointing  to  trial 
cluster  centroids  with  a  pointing  device. 

The  generation  of  initial  estimates  is  a  whole  field  in  itself.  We 
will  not  pursue  it  further  in  this  paper  but  will  merely  assume  that 
initial  estimates  for  the  iterative  maximum-likelihood  procedures  have 
been  supplied  by  some  other  source.  All  of  the  various  clustering 
techniques  developed  by  other  investigators  are  valuable  potential  sources 
of  initial  estimates. 

5.  NORMAL  MIXTURE  ANALYSIS 

To  illustrate  the  general  principles  of  ML  estimation  for  mixtures, 
let  us  consider  the  case  of  mixtures  of  multivariate  normal  distributions. 

m  1_ 

Let  as(x,us,oS)  =  (2Tr)  2  | as  | 2  exp  {-  i(x-pg)  'og (x-us)  } ,  (5.11 

where  aS=  { cjS  .  >  and  o  =oS  ={o^}. 

i  J  3  O 

The  derivatives  are  given  as  follows: 
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log  as 


m 

l 

j  =  l 


(Xj-V 


(5.2) 


3  log  a 


3a 


1J 


and 


(5.3) 


Substituting  the  derivatives  in  (2.9)  and  doing  a  little  algebra, 
we  find  the  following: 

Theorem  3  The  likelihood  equations  for  the  parameters  of  a  mixture  of 
multivariate  normal  distributions  with  unequal  covariance  matrices  are 
equivalent  to  the  following  equations: 

-  s  \ml  p<slxk)  (5-4> 


NA  k=l 
s 


P (s | xk) x 


ik 


(5.5) 


-J-! 

NA  k= 


P(S  l*k)  (Xi k_us i )  (Xjk'llsj, 


(5.6) 


Thus  the  equations  for  estimating  the  parameters  of  a  mixture  of  normal 
distributions  are  closely  analogous  to  the  equations  for  estimating  the 
parameters  of  a  pure  type  except  that  each  sample  point  is  weighted  by 
its  probability  of  membership. 

5 

When  the  types  have  a  common  covariance  matrix  o,  so  that  o  =o  for 
l<s<r,  then  equation  (2.9)  is  no  longer  valid,  since  it  was  derived  under 
the  assumption  that  the  parameters  of  different  types  were  not  functionally 
related.  The  correct  likelihood  equation  is 
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These  results  are  summarized  in  the  following: 


Theorem  4  The  likelihood  equations  for  the  parameters  of  a  mixture  of 
multivariate  normal  distributions  with  a  common  covariance  matrix  are 
equivalent  to  equations  (5.4),  (5.5)  and  the  following: 


1 


N 


o.  .  =  -  E  X..  X..  -  Z  A  u  .  p  • 
1 J  N  k=1  ik  jk  c  .,  s  si  sj 


(5.7) 


s- 1 


The  equations  of  Theorems  3  and  4  define  a  simplified  iteration 
method  which  is  optimal  in  the  limiting  case  of  very  widely  separated 
types  and  which  can  be  accelerated  in  many  other  cases  by  applying 
equations  (3.6)  and  (3.8)  to  the  mixing  proportions  and  the  means.  Two 
computer  programs  have  been  written  to  implement  this  procedure:  the 
program  for  Theorem  3  is  called  NORMIX  and  the  program  for  Theorem  4  is 
called  NORMAP.  Some  results  of  the  programs  will  be  presented  in 
section  7. 


6.  LATENT  CLASS  ANALYSIS 

The  general  principles  of  mixture  analysis  given  in  Theorems  1  and  2 
are  applicable  to  a  wide  variety  of  distributions.  The  first  application 
involved  continuous  distributions.  Next,  let  us  consider  the  quite 
different  discrete  distribution  employed  in  Lazarsfeld's  "Latent  Class" 
model  for  attitude  item  responses:^ 
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See  Anderson  (1959)  for  a  discussion  of  ML  estimation  and  Stanat  (1968) 


for  Fourier  transform  estimation  procedures. 


m  X.  1-X. 

as(x>M <J  =  n  Fc,  C 1  - M _ i  J  1,  where  X.=()  or  1. 

o  .SI  SI  X 


(6.1) 


9  log  a 


9p. 


si 


X.-p  . 

1  si 

P  . (1-p  . . 
SI  si) 


(6.2) 


Substituting  this  derivative  in  (2.9),  we  obtain 


-NX  ,  N 

-~£ —  =  - - - ,[p  rjf-  Z  P  (s  |  X.  )  X  . ,  ]  =  0. 

3"si  usi  _usi  *  51  “s  k=l  k  lk 


(6.3) 


Setting  the  term  in  brackets  to  zero  gives  us  the  following  remarkable 
fact . 

Theorem  5  The  maximum- 1 ikelihood  estimates  for  the  parameters  of  a 
mixture  of  latent  classes  are  solutions  to  equations  (5.4)  and  (5.5). 

Thus,  the  estimation  equations  for  this  discrete  distribution  are 
formally  identical  to  those  for  a  multivariate  normal  distribution,  the 
only  difference  being  that  P(s|xjJ  is  calculated  with  (6.1)  instead  of 
(5.1). 

7.  EXAMPLES  OF  COMPUTFR  ANALYSES 

This  section  presents  some  results  from  computer  runs  with  programs 
NORMAP  (normal  mixtures  with  common  covariance  matrix)  and  NORMIX  (normal 
mixtures  with  different  covariances  matrices).-  Both  programs  are  written 
for  flexible  input  so  as  to  lie  suitable  for  general  use  in  a  statistical 
1 ibrary . 


-Both  programs  are  written  in  FORTRAN  63  for  the  CDC  1604  and  will 
shortly  be  available  from  CO-OP,  Control  Data  Corporation,  3145  Porter 
Drive,  Palo  Alto,  California  94304. 
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The  programs  are  currently  limited  to  10  variables,  1000  individuals, 
and  20  types.  NORMAP  is  so  named  because  it  produces  a  "map"  of  the 
data  by  generating  a  printer-plot  of  the  sample  points  in  discriminant- 
space.  In  each  example  to  be  reported,  NORMAP  required  about  one  minute 
to  estimate  parameters  under  four  hypotheses  concerning  the  number  of 
types,  while  NORMIX  required  as  much  as  nine  minutes.  Several  runs  were 
made  with  different  initial  estimates.  In  those  cases  where  multiple 
solutions  were  obtained,  only  the  greatest  likelihood  results  were  used 
in  the  analysis. 

7.1  Iris  data 

The  classic  Iris  data  published  by  Fisher  (1936)  have  been  used  by 
Kendall  (1965)  and  Friedman  and  Rubin  (1967)  for  illustrating  their 
cluster  analysis  methods.  Of  course  Fisher  knew  the  correct  classifi¬ 
cations  of  each  of  the  150  irises  in  his  sample,  but  the  cluster  analysis 
studies  and  our  mixture  analyses  attempt  to  discover  and  describe  the 
types  of  irises  without  using  any  a  priori  classification  information. 

NORMAP  and  NORMIX  were  run  on  the  data  using  hypotheses  of  one  type, 
two  types,  three  types,  and  four  types.  Each  hypothesis  was  tested 
against  the  previous  one  with  x2  =  -2  log(Lr  ^/L  )  anc*  degrees  of  freedom 
equal  to  the  difference  in  the  number  of  parameters  in  the  two  hypotheses. 
The  results  are  given  in  Table  1. 
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The  significance  tests  definitely  indicate  that  the  hypothesis  of 
one  type  should  be  rejected  against  the  alternative  of  two  types,  and 
that  the  hypothesis  of  two  types  should  be  rejected  against  the  alternative 
of  three  types.  It  is  not  quite  so  clear  whether  the  hypothesis  of  three 
types  should  be  rejected  against  the  alternative  of  four  types.  Perhaps 
the  obtained  values  of  four  types  are  due  to  a  slight  skewness  in  the 
component  distributions,  or  perhaps  the  sample  size  is  so  small  that  the 
distribution  of  the  log  likelihood  ratio  differs  significantly  from  its 
asymptotic  chi-square  distribution.  The  results  seem  to  show  room  for 
improvement  in  the  hypothesis-testing  part  of  the  analysis. 

Table  2  presents  the  maximum- likel ihood  estimates  of  some  of  the 
parameters  obtained  by  NORMAP  and  NORMIX  for  three  types  and  compares  them 
with  the  estimates  obtained  by  Fisher  when  he  used  a  priori  knowledge  of 
the  correct  classifications.  The  row  labeled  "number  misclassificd"  gives 
the  number  of  sample  points  whose  highest  probabilities  of  membership 
occurred  in  types  different  from  their  actual  species.  Both  NORMIX  and 
NORMAP  gave  estimates  very  close  to  the  Fisher  values,  with  NORMAP  slightly 
better  than  NORMIX  in  most  cases.  The  three  flowers  "misclassified"  by 
NORMAP  were  identified  as  numbers  71,  84,  and  134.  These  are  precisely 
the  same  plants  which  Friedman  and  Rubin  (1967)  misclassificd  by  their 
| T | / | W |  maximization  procedure.  This  is  not  surprising,  because  NORMAP 
could  be  considered  to  be  a  continuous  version  of  the  discrete  partition¬ 
ing  procedure  of  Friedman  and  Rubin.  The  two  methods  tend  to  coincide  in 
the  limiting  case  of  widely  separated  types. 
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TABLE  2.  FISHER  IRIS  PARAMETERS 


7.2  Artificial  clusters 

The  second  example  involves  three  artificially  generated  clusters 
consisting  of  100,  75,  and  50  points  in  two  dimensions.  The  clusters 
were  deliberately  constructed  to  have  multivariate  normal  distributions 
with  different  covariance  matrices.  Also,  the  clusters  overlap  somewhat 
more  than  the  iris  species  did.  The  raw  data,  scatter  plots  and  some 
computer  printouts  are  available  in  a  previously  published  report  (Wolfe, 
1965) . 

The  results  of  the  hypothesis  testing  phase  of  the  analysis  are 
given  in  Table  3.  A  routine  application  of  the  x2  likelihood  ratio  test 
would  indicate  the  existence  of  more  than  three  types  in  the  data.  However 
an  examination  of  the  parameter  estimates  for  the  "fourth"  cluster  (not 
shown  here)  revealed  that  NORMIX  had  estimated  the  fourth  cluster's  mixing 
proportion  to  be  .030  -  the  equivalent  of  a  sample  of  seven  points.  The 
chi-square  approximation  is  inaccurate  for  this  sample  size.  Lvidently, 
further  research  is  required  to  develop  appropriate  significance  tests 

3 

for  small  samples. 

The  parameter  estimates  for  three  types  arc  given  in  Tabic  4. 

NORMIX  shows  a  clear  superiority  in  the  accuracy  of  its  estimates,  as 
would  be  expected  in  a  situation  where  the  types  have  unequal  covariance 
matrices . 

■\)ne  possibility  is  a  better  approximation  of  the  form 
2  a 

X  =  -2  log  0/(1  +  — )  where  Q  =  (L  ,/L  ),  n  =  N*A  .  ,  and  a  is  a  constant 
a  s  v  n,»  ^  r- 1  r'  min 

to  be  determined  by  methods  similar  to  those  given  by  Lawley  (1956),  or  if 
necessary  by  Monte  Carlo  methods.  The  examples  in  this  paper  seem  to 
require  a  value  of  a^lO. 
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TABLE  4.  ARTIFICIAL  CLUSTER  PARAMETERS 


nnnia  i —  ■  i—  m  ■  i 


8.  CONCLUDING  RB1ARKS 


By  reformulating  cluster  analysis  as  a  problem  in  estimation  for 
mixed  distributions,  we  hope  to  have  put  the  subject  on  a  more  rigorous 
foundation.  No  "similarities"  or  "distances"  need  to  be  assumed  a  priori. 
The  closest  analogy  to  a  "similarity"  in  our  system  is  the  probability  of 
membership  of  a  point  in  a  cluster,  but  this  probability  is  the  result 
of  an  iterative  solution  to  the  likelihood  equations  rather  than  an  arbi¬ 
trarily  given  function.  The  partitioning  methods  of  cluster  analysis  are 
seen  to  be  approximations  to  maximum-likelihood  solutions,  valid  when  the 
clusters  are  widely  separated.  Mixture  analysis  is  a  "smoothed"  version 
of  cluster  analysis.  The  difficult  combinatorial  problems  entailed  by 
discrete  partitioning  methods  are  avoided  by  weighting  the  sample  obser¬ 
vations  with  a  continuous  probability  of  membership  function. 

The  feasibility  of  the  iterative  solution  of  the  max imum- 1 i kcl  i  hood 
equations  has  been  demonstrated  by  the  examples  in  this  paper.  The 
iterations  involve  re-calculating  the  probabilities  of  membership  of  every 
point  in  each  type.  The  amount  of  computation  is  more  extensive  than 
statisticians  are  accustomed  to,  but  it  is  of  the  same  order  of  magnitude 
as  many  other  cluster  analysis  systems. 

Maximum- 1 ikcl ihood  estimation  procedures  have  been  used  throughout 
the  paper  because  of  the  ease  with  which  they  can  be  generalized  to  multi¬ 
variate  distributions  of  various  forms.  The  general  theory  of  mixtures 
has  been  applied  to  the  multivariate  normal  and  the  multivariate  Bernoulli 
distributions.  Considerably  more  work  needs  to  lie  done  a  applying  the 
theory  to  other  distributions,  in  developing  significance  tests  for  small 


samples,  and  in  finding  confidence  regions,  further  research  is  desirable 
in  the  areas  of  iteration  methods,  initial  estimation  procedures  and 
multiple  solutions.  Nevertheless,  we  believe  that  a  good  beginning  has 
been  made  toward  the  development  of  logical  and  practical  procedures  for 
the  analysis  of  mixtures. 
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Cluster  analysis  is  reformulated  as  a  problem  of  estimating  the  parameters 
of  a  mixture  of  multivariate  distributions.  The  maximum- 1 ikcl  ihood  theory 
and  numerical  solution  techniques  arc  developed  for  a  fairly  general  class 
of  distributions.  The  theory  is  applied  to  mixtures  of  multivariate  normals 
("NORM IX")  and  mixtures  of  multivariate  Bernoulli  distributions 
("Latent  Classes").  The  feasibility  of  the  procedures  is  demonstrated  by 
two  examples  of  computer  solution;  for  normal  mixture  models  of  the  Fisher 
Iris  data  and  of  artificially  generated  clusters  with  unequal  covariance 
matrices . 


